{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Data Reshaping\n", "\n", "Antes de analizar los datos, se necesita formar los datos obtenidos en un formato regular y que sea procesable por el algoritumo que luego utilizaremos. Es necesario asegurar que todos los datos correspondan con las variables. También, es necesario lidiar con los valores nulos, si es que hubiese. En términos generales, se puede decir que Data Reshaping es cambiar la manera en que los datos están organizados en coumnas y filas." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "1. [Join](#1)\n", "2. [Union](#2)\n", "3. [Stack, Unstack](#3)\n", "4. [Pivot](#4)\n", "5. [Melt](#5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Join\n", "\n", "\n", "\n", "Join, o merge, es el proceso de unir dos DataFrame diferentes en uno solo. Por ejemplo, si tenemos dos DataFrames que contienen diferente información pero sobre los mismos clientes, podemos unirlos en uno solo.\n", "\n", "Aquí algunos enlaces extras:\n", "\n", "__[pandas.merge](https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.merge.html)__\n", "\n", "__[Pandas : How to Merge Dataframes using Dataframe.merge() in Python – Part 1](https://thispointer.com/pandas-how-to-merge-dataframes-using-dataframe-merge-in-python-part-1/)__\n", "\n", "__[Pandas : How to Merge Dataframes using Dataframe.merge() in Python – Part 2](https://thispointer.com/pandas-merge-dataframes-on-specific-columns-or-on-index-in-python-part-2/)__\n", "\n", "__[Pandas : How to Merge Dataframes using Dataframe.merge() in Python – Part 3](https://thispointer.com/pandas-how-to-merge-dataframes-by-index-using-dataframe-merge-part-3/)__" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import os" ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CustomerIDGenderAgeAnnual_Income
146147Male4877
147148Female3277
148149Female3478
149150Male3478
150151Female4880
\n", "
" ], "text/plain": [ " CustomerID Gender Age Annual_Income\n", "146 147 Male 48 77\n", "147 148 Female 32 77\n", "148 149 Female 34 78\n", "149 150 Male 34 78\n", "150 151 Female 48 80" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mall_customers_info = pd.read_csv(os.path.join(\"csv\", \"mall_customers_info.csv\"))\n", "mall_customers_info.tail()" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(151, 4)" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mall_customers_info.shape" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CustomerIDSpending Score (1-100)
14614736
14714874
14814922
14915090
15015295
\n", "
" ], "text/plain": [ " CustomerID Spending Score (1-100)\n", "146 147 36\n", "147 148 74\n", "148 149 22\n", "149 150 90\n", "150 152 95" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mall_customers_score = pd.read_csv(os.path.join(\"csv\", \"mall_customers_score.csv\"))\n", "mall_customers_score.tail()" ] }, { "cell_type": "code", "execution_count": 64, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(151, 2)" ] }, "execution_count": 64, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mall_customers_score.shape" ] }, { "cell_type": "code", "execution_count": 63, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CustomerIDGenderAgeAnnual_IncomeSpending Score (1-100)
145146Male287797
146147Male487736
147148Female327774
148149Female347822
149150Male347890
\n", "
" ], "text/plain": [ " CustomerID Gender Age Annual_Income Spending Score (1-100)\n", "145 146 Male 28 77 97\n", "146 147 Male 48 77 36\n", "147 148 Female 32 77 74\n", "148 149 Female 34 78 22\n", "149 150 Male 34 78 90" ] }, "execution_count": 63, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#customer_data = pd.merge(mall_customers_info[['CustomerID','Gender','Annual_Income']], mall_customers_score, on='CustomerID')\n", "customer_data = pd.merge(mall_customers_info,mall_customers_score,on='CustomerID')\n", "customer_data.tail()\n" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(150, 5)" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "customer_data.shape" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CustomerIDGenderAgeAnnual_IncomeSpending Score (1-100)
146147Male487736.0
147148Female327774.0
148149Female347822.0
149150Male347890.0
150151Female4880NaN
\n", "
" ], "text/plain": [ " CustomerID Gender Age Annual_Income Spending Score (1-100)\n", "146 147 Male 48 77 36.0\n", "147 148 Female 32 77 74.0\n", "148 149 Female 34 78 22.0\n", "149 150 Male 34 78 90.0\n", "150 151 Female 48 80 NaN" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "customer_data = pd.merge(mall_customers_info,mall_customers_score,on='CustomerID', how='left')\n", "customer_data.tail()" ] }, { "cell_type": "code", "execution_count": 65, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CustomerIDGenderAgeAnnual_IncomeSpending Score (1-100)
146147Male48.077.036
147148Female32.077.074
148149Female34.078.022
149150Male34.078.090
150152NaNNaNNaN95
\n", "
" ], "text/plain": [ " CustomerID Gender Age Annual_Income Spending Score (1-100)\n", "146 147 Male 48.0 77.0 36\n", "147 148 Female 32.0 77.0 74\n", "148 149 Female 34.0 78.0 22\n", "149 150 Male 34.0 78.0 90\n", "150 152 NaN NaN NaN 95" ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" } ], "source": [ "customer_data = pd.merge(mall_customers_info,mall_customers_score,on='CustomerID', how='right')\n", "customer_data.tail()" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CustomerIDGenderAgeAnnual_IncomeSpending Score (1-100)
147148Female32.077.074.0
148149Female34.078.022.0
149150Male34.078.090.0
150151Female48.080.0NaN
151152NaNNaNNaN95.0
\n", "
" ], "text/plain": [ " CustomerID Gender Age Annual_Income Spending Score (1-100)\n", "147 148 Female 32.0 77.0 74.0\n", "148 149 Female 34.0 78.0 22.0\n", "149 150 Male 34.0 78.0 90.0\n", "150 151 Female 48.0 80.0 NaN\n", "151 152 NaN NaN NaN 95.0" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "customer_data = pd.merge(mall_customers_info,mall_customers_score,on='CustomerID', how='outer')\n", "customer_data.tail()" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CustomerIDGenderAgeAnnual_IncomeSpending Score (1-100)
145146Male287797
146147Male487736
147148Female327774
148149Female347822
149150Male347890
\n", "
" ], "text/plain": [ " CustomerID Gender Age Annual_Income Spending Score (1-100)\n", "145 146 Male 28 77 97\n", "146 147 Male 48 77 36\n", "147 148 Female 32 77 74\n", "148 149 Female 34 78 22\n", "149 150 Male 34 78 90" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "customer_data = pd.merge(mall_customers_info,mall_customers_score,on='CustomerID', how='inner')\n", "customer_data.tail()" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(150, 5)" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "customer_data.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Union\n", "\n", "\n", "\n", "__[pandas.concat](https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.concat.html)__" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CustomerIDGenderAgeAnnual_IncomeSpending Score (1-100)
0151Male437817
1152Male397888
2153Female447820
3154Female387876
4155Female477816
\n", "
" ], "text/plain": [ " CustomerID Gender Age Annual_Income Spending Score (1-100)\n", "0 151 Male 43 78 17\n", "1 152 Male 39 78 88\n", "2 153 Female 44 78 20\n", "3 154 Female 38 78 76\n", "4 155 Female 47 78 16" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mall_customers_more = pd.read_csv(os.path.join(\"csv\", \"customers_data_2.csv\"))\n", "mall_customers_more.head()" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(50, 5)" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mall_customers_more.shape" ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CustomerIDGenderAgeAnnual_IncomeSpending Score (1-100)
40191Female3410323
41192Female3210369
42193Male331138
43194Female3811391
44195Female4712016
45196Female3512079
46197Female4511528
47198Male3211574
48199Male3217070
49200Male3016573
\n", "
" ], "text/plain": [ " CustomerID Gender Age Annual_Income Spending Score (1-100)\n", "40 191 Female 34 103 23\n", "41 192 Female 32 103 69\n", "42 193 Male 33 113 8\n", "43 194 Female 38 113 91\n", "44 195 Female 47 120 16\n", "45 196 Female 35 120 79\n", "46 197 Female 45 115 28\n", "47 198 Male 32 115 74\n", "48 199 Male 32 170 70\n", "49 200 Male 30 165 73" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "customers_data_all = pd.concat([customer_data, mall_customers_more])\n", "#customers_data_all.sample(10)\n", "customers_data_all.tail(10)" ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(200, 5)" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "customers_data_all.shape" ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [], "source": [ "customers_data_all.reset_index(inplace=True, drop=True)" ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CustomerIDGenderAgeAnnual_IncomeSpending Score (1-100)
190191Female3410323
191192Female3210369
192193Male331138
193194Female3811391
194195Female4712016
195196Female3512079
196197Female4511528
197198Male3211574
198199Male3217070
199200Male3016573
\n", "
" ], "text/plain": [ " CustomerID Gender Age Annual_Income Spending Score (1-100)\n", "190 191 Female 34 103 23\n", "191 192 Female 32 103 69\n", "192 193 Male 33 113 8\n", "193 194 Female 38 113 91\n", "194 195 Female 47 120 16\n", "195 196 Female 35 120 79\n", "196 197 Female 45 115 28\n", "197 198 Male 32 115 74\n", "198 199 Male 32 170 70\n", "199 200 Male 30 165 73" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#customers_data_all.sample(10)\n", "customers_data_all.tail(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Stack, Unstack\n", "\n", "\n", "\n", "__[Reshape using Stack() and unstack() function in Pandas python](http://www.datasciencemadesimple.com/reshape-using-stack-unstack-function-pandas-python/)__" ] }, { "cell_type": "code", "execution_count": 54, "metadata": { "scrolled": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
YYYYJANFEBMARAPRMAYJUNJULAUGSEPOCTNOVDECYEAR
0200810140102391005010111101591015910141101171017810148101251018210146
1200910137101401014010141101881016810128101651020810166100411006810141
2201010151100341016810194101581016610158101291014710135100571013310136
3201110182101611022710192101821015410123101301014910182101941009910165
4201210194102861027110053101591012710139101551014910109101081008510153
\n", "
" ], "text/plain": [ " YYYY JAN FEB MAR APR MAY JUN JUL AUG SEP OCT \\\n", "0 2008 10140 10239 10050 10111 10159 10159 10141 10117 10178 10148 \n", "1 2009 10137 10140 10140 10141 10188 10168 10128 10165 10208 10166 \n", "2 2010 10151 10034 10168 10194 10158 10166 10158 10129 10147 10135 \n", "3 2011 10182 10161 10227 10192 10182 10154 10123 10130 10149 10182 \n", "4 2012 10194 10286 10271 10053 10159 10127 10139 10155 10149 10109 \n", "\n", " NOV DEC YEAR \n", "0 10125 10182 10146 \n", "1 10041 10068 10141 \n", "2 10057 10133 10136 \n", "3 10194 10099 10165 \n", "4 10108 10085 10153 " ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "datos_mensuales = pd.read_csv('./csv/monthly_data.csv')\n", "datos_mensuales.head(5)" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
JANFEBMARAPRMAYJUNJULAUGSEPOCTNOVDECYEAR
YYYY
200810140102391005010111101591015910141101171017810148101251018210146
200910137101401014010141101881016810128101651020810166100411006810141
201010151100341016810194101581016610158101291014710135100571013310136
201110182101611022710192101821015410123101301014910182101941009910165
201210194102861027110053101591012710139101551014910109101081008510153
\n", "
" ], "text/plain": [ " JAN FEB MAR APR MAY JUN JUL AUG SEP OCT \\\n", "YYYY \n", "2008 10140 10239 10050 10111 10159 10159 10141 10117 10178 10148 \n", "2009 10137 10140 10140 10141 10188 10168 10128 10165 10208 10166 \n", "2010 10151 10034 10168 10194 10158 10166 10158 10129 10147 10135 \n", "2011 10182 10161 10227 10192 10182 10154 10123 10130 10149 10182 \n", "2012 10194 10286 10271 10053 10159 10127 10139 10155 10149 10109 \n", "\n", " NOV DEC YEAR \n", "YYYY \n", "2008 10125 10182 10146 \n", "2009 10041 10068 10141 \n", "2010 10057 10133 10136 \n", "2011 10194 10099 10165 \n", "2012 10108 10085 10153 " ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Preparacion: usar 'YYYY' como el ID/Indice\n", "datos_mensuales.set_index('YYYY', inplace=True)\n", "datos_mensuales.head(5)" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "YYYY \n", "2008 JAN 10140\n", " FEB 10239\n", " MAR 10050\n", " APR 10111\n", " MAY 10159\n", " ... \n", "2017 SEP 10135\n", " OCT 10176\n", " NOV 10141\n", " DEC 10120\n", " YEAR 10160\n", "Length: 130, dtype: int64" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#En valor en cada columna se transforma en una fila\n", "datos_mensuales.stack()" ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 11538 entries, 0 to 11537\n", "Data columns (total 12 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 id 11538 non-null int64 \n", " 1 name 11538 non-null object \n", " 2 nationality 11538 non-null object \n", " 3 sex 11538 non-null object \n", " 4 date_of_birth 11538 non-null object \n", " 5 height 11208 non-null float64\n", " 6 weight 10879 non-null float64\n", " 7 sport 11538 non-null object \n", " 8 gold 11538 non-null int64 \n", " 9 silver 11538 non-null int64 \n", " 10 bronze 11538 non-null int64 \n", " 11 info 131 non-null object \n", "dtypes: float64(2), int64(4), object(6)\n", "memory usage: 1.1+ MB\n" ] } ], "source": [ "athletes = pd.read_csv('./csv/athletes.csv')\n", "athletes.info()" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idnamenationalitysexdate_of_birthheightweightsportgoldsilverbronzeinfo
0736041664A Jesus GarciaESPmale1969-10-171.7264.0athletics000NaN
1532037425A Lam ShinKORfemale1986-09-231.6856.0fencing000NaN
2435962603Aaron BrownCANmale1992-05-271.9879.0athletics001NaN
3521041435Aaron CookMDAmale1991-01-021.8380.0taekwondo000NaN
433922579Aaron GateNZLmale1990-11-261.8171.0cycling000NaN
\n", "
" ], "text/plain": [ " id name nationality sex date_of_birth height \\\n", "0 736041664 A Jesus Garcia ESP male 1969-10-17 1.72 \n", "1 532037425 A Lam Shin KOR female 1986-09-23 1.68 \n", "2 435962603 Aaron Brown CAN male 1992-05-27 1.98 \n", "3 521041435 Aaron Cook MDA male 1991-01-02 1.83 \n", "4 33922579 Aaron Gate NZL male 1990-11-26 1.81 \n", "\n", " weight sport gold silver bronze info \n", "0 64.0 athletics 0 0 0 NaN \n", "1 56.0 fencing 0 0 0 NaN \n", "2 79.0 athletics 0 0 1 NaN \n", "3 80.0 taekwondo 0 0 0 NaN \n", "4 71.0 cycling 0 0 0 NaN " ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "athletes.head(5)" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "sport sex \n", "aquatics female 62.284483\n", " male 82.219061\n", "archery female 64.301587\n", " male 80.079365\n", "athletics female 60.152542\n", " male 74.777680\n", "badminton female 61.209877\n", " male 76.156627\n", "basketball female 75.377622\n", " male 100.297872\n", "boxing female NaN\n", " male NaN\n", "canoe female 66.457944\n", " male 82.150000\n", "cycling female 60.207254\n", " male 72.576052\n", "equestrian female 58.634146\n", " male 72.954887\n", "fencing female 62.733871\n", " male 78.785124\n", "football female 61.061069\n", " male 74.451713\n", "golf female 63.200000\n", " male 79.000000\n", "gymnastics female 49.555024\n", " male 63.254545\n", "handball female 70.789157\n", " male 95.431694\n", "hockey female 60.425926\n", " male 77.375000\n", "judo female 65.392157\n", " male 84.616740\n", "modern pentathlon female 58.000000\n", " male 73.916667\n", "rowing female 69.776190\n", " male 86.504615\n", "rugby sevens female 66.595890\n", " male 90.450331\n", "sailing female 63.194969\n", " male 77.122066\n", "shooting female 62.675676\n", " male 81.068966\n", "table tennis female 57.530120\n", " male 72.558140\n", "taekwondo female 61.258065\n", " male 74.809524\n", "tennis female 64.670455\n", " male 80.417476\n", "triathlon female 54.563636\n", " male 66.814815\n", "volleyball female 70.684211\n", " male 89.421875\n", "weightlifting female 68.788462\n", " male 87.538961\n", "wrestling female 61.805310\n", " male 85.372881\n", "Name: weight, dtype: float64" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "weight_mean_by_sport_and_sex = athletes.groupby(['sport', 'sex'])['weight'].mean()\n", "weight_mean_by_sport_and_sex" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sexfemalemale
sport
aquatics62.28448382.219061
archery64.30158780.079365
athletics60.15254274.777680
badminton61.20987776.156627
basketball75.377622100.297872
boxingNaNNaN
canoe66.45794482.150000
cycling60.20725472.576052
equestrian58.63414672.954887
fencing62.73387178.785124
football61.06106974.451713
golf63.20000079.000000
gymnastics49.55502463.254545
handball70.78915795.431694
hockey60.42592677.375000
judo65.39215784.616740
modern pentathlon58.00000073.916667
rowing69.77619086.504615
rugby sevens66.59589090.450331
sailing63.19496977.122066
shooting62.67567681.068966
table tennis57.53012072.558140
taekwondo61.25806574.809524
tennis64.67045580.417476
triathlon54.56363666.814815
volleyball70.68421189.421875
weightlifting68.78846287.538961
wrestling61.80531085.372881
\n", "
" ], "text/plain": [ "sex female male\n", "sport \n", "aquatics 62.284483 82.219061\n", "archery 64.301587 80.079365\n", "athletics 60.152542 74.777680\n", "badminton 61.209877 76.156627\n", "basketball 75.377622 100.297872\n", "boxing NaN NaN\n", "canoe 66.457944 82.150000\n", "cycling 60.207254 72.576052\n", "equestrian 58.634146 72.954887\n", "fencing 62.733871 78.785124\n", "football 61.061069 74.451713\n", "golf 63.200000 79.000000\n", "gymnastics 49.555024 63.254545\n", "handball 70.789157 95.431694\n", "hockey 60.425926 77.375000\n", "judo 65.392157 84.616740\n", "modern pentathlon 58.000000 73.916667\n", "rowing 69.776190 86.504615\n", "rugby sevens 66.595890 90.450331\n", "sailing 63.194969 77.122066\n", "shooting 62.675676 81.068966\n", "table tennis 57.530120 72.558140\n", "taekwondo 61.258065 74.809524\n", "tennis 64.670455 80.417476\n", "triathlon 54.563636 66.814815\n", "volleyball 70.684211 89.421875\n", "weightlifting 68.788462 87.538961\n", "wrestling 61.805310 85.372881" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Mueve cada valor del ultimo nivel de un indice mutinivel a un columna\n", "weight_mean_by_sport_and_sex.unstack()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Pivot\n", "\n", "\n", "\n", "__[pandas.DataFrame.pivot](https://pandas.pydata.org/pandas-docs/version/0.22.0/generated/pandas.DataFrame.pivot.html)__" ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
iditemvalue
0823905prize3.49
1823905unitkg
2235897prize12.89
3235897unitl
4235897stock50
5983422prize0.49
6983422stock4
\n", "
" ], "text/plain": [ " id item value\n", "0 823905 prize 3.49\n", "1 823905 unit kg\n", "2 235897 prize 12.89\n", "3 235897 unit l\n", "4 235897 stock 50\n", "5 983422 prize 0.49\n", "6 983422 stock 4" ] }, "execution_count": 56, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#El ID se repite para cada uno de las propiedaded del producto :/\n", "products = pd.DataFrame({'id': [823905, 823905,\n", " 235897, 235897, 235897,\n", " 983422, 983422],\n", " 'item': ['prize', 'unit', \n", " 'prize', 'unit', 'stock', \n", " 'prize', 'stock'],\n", " 'value': [3.49, 'kg',\n", " 12.89, 'l', 50,\n", " 0.49, 4]})\n", "products" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.pivot.html\n", "\n", "De manera similar a unstack, sirve para mover fila como columnas\n", "y así construir un DF con más columnas y menos filas\n", "\n", "Usar id como el indice de cada fila y\n", "los valores en la columna 'item' para crear columnas.\n", "Los valores que debe aparecer en cada columnas esta en la columna 'value" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
itemprizestockunit
id
23589712.8950l
8239053.49NaNkg
9834220.494NaN
\n", "
" ], "text/plain": [ "item prize stock unit\n", "id \n", "235897 12.89 50 l\n", "823905 3.49 NaN kg\n", "983422 0.49 4 NaN" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "products.pivot(index='id', columns='item', values='value')" ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
datesymbolopenhighlowclosevolume
02019-03-01AMZN1655.131674.261651.001671.734974877
12019-03-04AMZN1685.001709.431674.361696.176167358
22019-03-05AMZN1702.951707.801689.011692.433681522
32019-03-06AMZN1695.971697.751668.281668.953996001
42019-03-07AMZN1667.371669.751620.511625.954957017
52019-03-01AAPL174.28175.15172.89174.9725886167
62019-03-04AAPL175.69177.75173.97175.8527436203
72019-03-05AAPL175.94176.00174.54175.5319737419
82019-03-06AAPL174.67175.49173.94174.5220810384
92019-03-07AAPL173.87174.44172.02172.5024796374
\n", "
" ], "text/plain": [ " date symbol open high low close volume\n", "0 2019-03-01 AMZN 1655.13 1674.26 1651.00 1671.73 4974877\n", "1 2019-03-04 AMZN 1685.00 1709.43 1674.36 1696.17 6167358\n", "2 2019-03-05 AMZN 1702.95 1707.80 1689.01 1692.43 3681522\n", "3 2019-03-06 AMZN 1695.97 1697.75 1668.28 1668.95 3996001\n", "4 2019-03-07 AMZN 1667.37 1669.75 1620.51 1625.95 4957017\n", "5 2019-03-01 AAPL 174.28 175.15 172.89 174.97 25886167\n", "6 2019-03-04 AAPL 175.69 177.75 173.97 175.85 27436203\n", "7 2019-03-05 AAPL 175.94 176.00 174.54 175.53 19737419\n", "8 2019-03-06 AAPL 174.67 175.49 173.94 174.52 20810384\n", "9 2019-03-07 AAPL 173.87 174.44 172.02 172.50 24796374" ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" } ], "source": [ "stocks = pd.read_csv('https://gist.githubusercontent.com/alexdebrie/b3f40efc3dd7664df5a20f5eee85e854/raw/ee3e6feccba2464cbbc2e185fb17961c53d2a7f5/stocks.csv')\n", "stocks.head(10)" ] }, { "cell_type": "code", "execution_count": 62, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
date2019-03-012019-03-042019-03-052019-03-062019-03-07
symbol
AAPL2588616727436203197374192081038424796374
AMZN49748776167358368152239960014957017
GOOG14503161446047144317410992891166559
\n", "
" ], "text/plain": [ "date 2019-03-01 2019-03-04 2019-03-05 2019-03-06 2019-03-07\n", "symbol \n", "AAPL 25886167 27436203 19737419 20810384 24796374\n", "AMZN 4974877 6167358 3681522 3996001 4957017\n", "GOOG 1450316 1446047 1443174 1099289 1166559" ] }, "execution_count": 62, "metadata": {}, "output_type": "execute_result" } ], "source": [ "stocks.pivot(index='symbol', columns='date', values='volume')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Melt\n", "\n", "\n", "\n", "__[pandas.DataFrame.melt](https://pandas.pydata.org/pandas-docs/version/0.22.0/generated/pandas.DataFrame.melt.html)__" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`melt()` hace lo opuesto a `pivot()`" ] }, { "cell_type": "code", "execution_count": 58, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
indextest_1test_2test_3
0Mary645
1John787
2Ann679
3Pete655
4Laura527
\n", "
" ], "text/plain": [ " index test_1 test_2 test_3\n", "0 Mary 6 4 5\n", "1 John 7 8 7\n", "2 Ann 6 7 9\n", "3 Pete 6 5 5\n", "4 Laura 5 2 7" ] }, "execution_count": 58, "metadata": {}, "output_type": "execute_result" } ], "source": [ "grades = pd.DataFrame([[6, 4, 5], [7, 8, 7], [6, 7, 9], [6, 5, 5], [5, 2, 7]], \n", " index = ['Mary', 'John', 'Ann', 'Pete', 'Laura'],\n", " columns = ['test_1', 'test_2', 'test_3'])\n", "grades.reset_index(inplace=True)\n", "grades" ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
indexvariablevalue
0Marytest_16
1Johntest_17
2Anntest_16
3Petetest_16
4Lauratest_15
5Marytest_24
6Johntest_28
7Anntest_27
8Petetest_25
9Lauratest_22
10Marytest_35
11Johntest_37
12Anntest_39
13Petetest_35
14Lauratest_37
\n", "
" ], "text/plain": [ " index variable value\n", "0 Mary test_1 6\n", "1 John test_1 7\n", "2 Ann test_1 6\n", "3 Pete test_1 6\n", "4 Laura test_1 5\n", "5 Mary test_2 4\n", "6 John test_2 8\n", "7 Ann test_2 7\n", "8 Pete test_2 5\n", "9 Laura test_2 2\n", "10 Mary test_3 5\n", "11 John test_3 7\n", "12 Ann test_3 9\n", "13 Pete test_3 5\n", "14 Laura test_3 7" ] }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" } ], "source": [ "grades.melt(id_vars=['index']) # indicar las columna que identifican a cada entidad" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.7" } }, "nbformat": 4, "nbformat_minor": 2 }